This notebook includes preliminary attempts to visualize some basic information about the dataset.

Data Messiness

This section details the messiness of our dataset. First, we took a quick look at a few ways that items have been disaggregated.

When we initially made unique measures seperate from one another, we concatenated all of the columns in the dataset having to do iwth disaggregation. Based on a cursory look, these are some of the breakdowns (note that these categories may not be complete). When we could identify that everyone appeared to be included (e.g., ‘ALLREGIONS’ or ‘BOTHSEX’), we did not count these measures as ‘disaggregated.’

`summarise()` has grouped output by 'Target', 'Indicator', 'SeriesDescription'. You can override using the `.groups` argument.
<error/rlang_error>
Must group by variables found in `.data`.
* Column `disaggregated_` is not found.
Backtrace:
 1. `%>%`(...)
 6. dplyr:::group_by.data.frame(., Indicator, disaggregated_)
 7. dplyr::group_by_prepare(.data, ..., .add = .add, caller_env = caller_env())
Run `rlang::last_trace()` to see the full context.
`summarise()` has grouped output by 'disaggregation'. You can override using the `.groups` argument.

'hctreemap' is deprecated.
Use 'data_to_hierarchical' instead.
See help("Deprecated")

This is a bit more of a look at the above disaggregation, wherein we look also at whether measures are disaggregated and how many (per target, subset by goal)

Finally, the following is an example of our current progress (with Indonesia) in terms of how many indicators we have removed for each target / goal.

processedIndo =  read.csv('~/QMSS/G5055_Practicum_Project2/Data/processedIndo.csv')
nrow(processedIndo)
[1] 4230
processedIndo_No_Disagg = read.csv('~/QMSS/G5055_Practicum_Project2/Data/processedIndo-WITHOUT disaggregation.csv')
nrow(processedIndo_No_Disagg)
[1] 1809

Guatemala

Also wanted to look at the same with guatemala

`summarise()` has grouped output by 'Target', 'Indicator', 'SeriesDescription'. You can override using the `.groups` argument.
<error/rlang_error>
Must group by variables found in `.data`.
* Column `disaggregated_` is not found.
Backtrace:
 1. `%>%`(...)
 6. dplyr:::group_by.data.frame(., Indicator, disaggregated_)
 7. dplyr::group_by_prepare(.data, ..., .add = .add, caller_env = caller_env())
Run `rlang::last_trace()` to see the full context.
`summarise()` has grouped output by 'disaggregation'. You can override using the `.groups` argument.

'hctreemap' is deprecated.
Use 'data_to_hierarchical' instead.
See help("Deprecated")

This is a bit more of a look at the above disaggregation, wherein we look also at whether measures are disaggregated and how many (per target, subset by goal)

---
title: "Practicum SDG Group 2"
output: html_notebook
---

This notebook includes preliminary attempts to visualize some basic information about the dataset. 
```{r install packages, echo = FALSE, eval=TRUE, results='hide',message=FALSE,warning=FALSE}
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages("ggthemes") # Install 
install.packages("gridExtra")
install.packages('plotly')
install.packages('listviewer')
install.packages('gridExtra')
install.packages('treemap')
install.packages('highcharter')
install.packages('DT')
```

```{r load libraries, echo = FALSE, eval=TRUE, results='hide',message=FALSE,warning=FALSE}
# data visualizations
library(ggthemes) 
library(ggplot2)
library(forcats)
library(viridis)
library(hrbrthemes)
library(plotly)
library(gridExtra)
library(highcharter)
library(treemap)
library(DT)

# data manipulation
library(magrittr)
library(dplyr)
```

```{r, include=FALSE, echo=FALSE}
## This is a test use case of the dataset. 
# Goal 9: Industry, Innovation, and Infrastructure 
indonesia_time <- read.csv('~/QMSS/G5055_Practicum_Project2/Data/indonesia_indicators_time.csv')

df_9 <- indonesia_time %>% subset(Target == '9.2')
df_9

df_9_pivot <- df_9 %>% rename('2018' = X2018.0, '2019' = X2019.0, '2020' = X2020.0,'2021' = X2021.0) #%>% 
df_9_pivot <- df_9_pivot %>% tidyr::pivot_longer(cols = '2018':'2021', names_to='year')

df_9_pivot <- df_9_pivot %>% group_by(UniqueID,year,value) %>% summarize()
df_9_pivot <- df_9_pivot %>% mutate(year = as.numeric(year))
p <- df_9_pivot %>% 
  ggplot(., aes(x=year, y = value, color = UniqueID )) +
  geom_point(alpha = 0.4) +
  geom_line()+
  theme_ipsum(base_size = 10, axis_title_size = 12) +
  scale_colour_viridis(direction = -1, discrete = T) +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank())

ggplotly(p)
```

# Data Messiness 

This section details the messiness of our dataset. First, we took a quick look at a few ways that items have been disaggregated. 

When we initially made unique measures seperate from one another, we concatenated all of the columns in the dataset having to do iwth disaggregation. Based on a cursory look, these are some of the breakdowns (note that these categories may not be complete). When we could identify that everyone appeared to be included (e.g., 'ALLREGIONS' or 'BOTHSEX'), we did not count these measures as 'disaggregated.' 

```{r indo tree map, echo=FALSE, eval=TRUE }

indonesia_measures <- indonesia_time %>% group_by(Target,Indicator,SeriesDescription,UniqueID) %>% summarize()

indonesia_measures <- indonesia_measures %>% mutate(disaggregation = ifelse(grepl('MALE|FEMALE|15-24|25+|15-49',UniqueID)==TRUE,
                                                                            'age_sex',
                                                                            ifelse(grepl('URBAN|RURAL',UniqueID)==TRUE,
                                                                                   'geographic_region',
                                                                                   ifelse(grepl('<5Y|<1Y|<1M',UniqueID)==TRUE,
                                                                                          'time',
                                                                                          ifelse(grepl('MIL|CAN|DIA|RES|CAR|NFO|CRO|NMA|WOD|ALP|WCH|PET|BIM|MEO|GBO|NMM|FOF|CLO|OIL|TEX|NMC',UniqueID)==TRUE,
                                                                                                 'raw_material',
                                                                                                 ifelse(grepl('UPPSEC|LOWSEC',UniqueID)==TRUE,
                                                                                          'sector','other/not_disaggregated'))))))

rlang::last_error()
tm <-indonesia_measures %>% group_by(disaggregation) %>% mutate(count = n()) %>% ungroup() %>% group_by(disaggregation,count) %>% summarize() %>%
  treemap(index="disaggregation",
          vSize="count",
        type="index",
        fontsize.labels=c(12, 8), 
        palette = "Blues",
        fontfamily.title = "Arial Narrow",
        fontfamily.labels = "Arial Narrow",
        border.col="white",
        title = 'Measures and Disaggregation'
  )

hctreemap(tm, allowDrillToNode = TRUE, layoutAlgorithm = "squarified") %>%
  hc_title(text = "Disaggregated Data (Indonesia) ") %>%
  hc_tooltip(pointFormat = "<b>{point.name}</b>:<br>
                             Number of Measures: {point.value:,.0f}<br>")

```

This is a bit more of a look at the above disaggregation, wherein we look also at whether measures are disaggregated and how many (per target, subset by goal)
```{r indo bar chart, echo=FALSE, eval =TRUE, message=FALSE, warning=FALSE}
indonesia_disaggregated_indicators <- indonesia_time %>% mutate(disaggregation = ifelse(grepl('MALE|FEMALE|15-24|25+|15-49',UniqueID)==TRUE,'age_sex', ifelse(grepl('URBAN|RURAL',UniqueID)==TRUE,'geographic_region',
                                                                                   ifelse(grepl('<5Y|<1Y|<1M',UniqueID)==TRUE,'time',
ifelse(grepl('MIL|CAN|DIA|RES|CAR|NFO|CRO|NMA|WOD|ALP|WCH|PET|BIM|MEO|GBO|NMM|FOF|CLO|OIL|TEX|NMC',UniqueID)==TRUE,'raw_material',
       ifelse(grepl('UPPSEC|LOWSEC',UniqueID)==TRUE,'sector','other/not_disaggregated')))))) %>% mutate(disaggregated_ = ifelse(disaggregation =='other/not_disaggregated','not disaggregated','disaggregated')) %>% group_by(Indicator,disaggregated_) %>% mutate(count_disaggregated = n()) %>% ungroup() %>% group_by(Target,Indicator,disaggregated_,count_disaggregated) 

for(i in 1:17){
print(
  indonesia_disaggregated_indicators %>% subset(Goal==i) %>% 
    ggplot(aes(x=Target,fill=disaggregated_)) +
    geom_bar(position='stack',binwidth=3) +
    scale_fill_manual(values = c('disaggregated'='navy','not disaggregated' = 'lightblue'))+
    theme_ipsum(base_size = 12, axis_title_size = 14) +
    theme(
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
        labs(title = paste('Goal ',i),
              subtitle = "A number of goals and targets include majority of disaggregated measures")+    
    labs(fill = "Disaggregation", y="count", x = "") 
)
}

```

Finally, the following is an example of our current progress (with Indonesia) in terms of how many indicators we have removed for each target / goal. 
```{r our processes}
processedIndo =  read.csv('~/QMSS/G5055_Practicum_Project2/Data/processedIndo.csv')
nrow(processedIndo)
processedIndo_No_Disagg = read.csv('~/QMSS/G5055_Practicum_Project2/Data/processedIndo-WITHOUT disaggregation.csv')
nrow(processedIndo_No_Disagg)
```



## Guatemala

Also wanted to look at the same with guatemala 

```{r guatemala tree map, echo=FALSE, eval = TRUE }
guatemala_time <- read.csv('~/QMSS/G5055_Practicum_Project2/Data/guatemala_indicators_time.csv')

guatemala_measures <- guatemala_time %>% group_by(Target,Indicator,SeriesDescription,UniqueID) %>% summarize()

guatemala_measures <- guatemala_measures %>% mutate(disaggregation = ifelse(grepl('MALE|FEMALE|15-24|25+|15-49',UniqueID)==TRUE,
                                                                            'age_sex',
                                                                            ifelse(grepl('URBAN|RURAL',UniqueID)==TRUE,
                                                                                   'geographic_region',
                                                                                   ifelse(grepl('<5Y|<1Y|<1M',UniqueID)==TRUE,
                                                                                          'time',
                                                                                          ifelse(grepl('MIL|CAN|DIA|RES|CAR|NFO|CRO|NMA|WOD|ALP|WCH|PET|BIM|MEO|GBO|NMM|FOF|CLO|OIL|TEX|NMC',UniqueID)==TRUE,
                                                                                                 'raw_material',
                                                                                                 ifelse(grepl('UPPSEC|LOWSEC',UniqueID)==TRUE,
                                                                                          'sector','other/not_disaggregated'))))))

rlang::last_error()
tm <-guatemala_measures %>% group_by(disaggregation) %>% mutate(count = n()) %>% ungroup() %>% group_by(disaggregation,count) %>% summarize() %>%
  treemap(index="disaggregation",
          vSize="count",
        type="index",
        fontsize.labels=c(12, 8), 
        palette = "Purples",
        fontfamily.title = "Arial Narrow",
        fontfamily.labels = "Arial Narrow",
        border.col="white",
        title = 'Measures and Disaggregation'
  )

hctreemap(tm, allowDrillToNode = TRUE, layoutAlgorithm = "squarified") %>%
  hc_title(text = "Disaggregated Data (Guatemala)") %>%
  hc_tooltip(pointFormat = "<b>{point.name}</b>:<br>
                             Number of Measures: {point.value:,.0f}<br>")

```

This is a bit more of a look at the above disaggregation, wherein we look also at whether measures are disaggregated and how many (per target, subset by goal)
```{r guatemala bar chart, echo=FALSE, eval =TRUE, message=FALSE, warning=FALSE}
guatemala_disaggregated_indicators <- guatemala_time %>% mutate(disaggregation = ifelse(grepl('MALE|FEMALE|15-24|25+|15-49',UniqueID)==TRUE,'age_sex', ifelse(grepl('URBAN|RURAL',UniqueID)==TRUE,'geographic_region',
                                                                                   ifelse(grepl('<5Y|<1Y|<1M',UniqueID)==TRUE,'time',
ifelse(grepl('MIL|CAN|DIA|RES|CAR|NFO|CRO|NMA|WOD|ALP|WCH|PET|BIM|MEO|GBO|NMM|FOF|CLO|OIL|TEX|NMC',UniqueID)==TRUE,'raw_material',
       ifelse(grepl('UPPSEC|LOWSEC',UniqueID)==TRUE,'sector','other/not_disaggregated')))))) %>% mutate(disaggregated_ = ifelse(disaggregation =='other/not_disaggregated','not disaggregated','disaggregated')) %>% group_by(Indicator,disaggregated_) %>% mutate(count_disaggregated = n()) %>% ungroup() %>% group_by(Target,Indicator,disaggregated_,count_disaggregated) 

for(i in 1:17){
print(
  guatemala_disaggregated_indicators %>% subset(Goal==i) %>% 
    ggplot(aes(x=Target,fill=disaggregated_)) +
    geom_bar(position='stack',binwidth=3) +
    scale_fill_manual(values = c('disaggregated'='maroon','not disaggregated' = 'lavender'))+
    theme_ipsum(base_size = 12, axis_title_size = 14) +
    theme(
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
        labs(title = paste('Goal ',i),
              subtitle = "A number of goals and targets include majority of disaggregated measures")+    
    labs(fill = "Disaggregation", y="count", x = "") 
)
}

```


